Add DType::Union and UnionVariants#7890
Conversation
d02a438 to
1fa4a6d
Compare
1fa4a6d to
ad61b12
Compare
Merging this PR will not alter performance
|
4e8802c to
4e86fc3
Compare
| DType::Union(variants, _) => variants | ||
| .variants() | ||
| .all(|variant| supports_uncompressed_size_in_bytes(&variant)), |
There was a problem hiding this comment.
Happy to remove this until the actual uncompressed_size_in_bytes is implemented later?
| // TODO(connor): This seems wrong... | ||
| DType::FixedSizeList(_, list_size, _) => value.as_list().len() == *list_size as usize, | ||
| // TODO(connor): This also seems wrong... | ||
| DType::Struct(struct_fields, _) => value.as_list().len() == struct_fields.nfields(), |
|
I think my local version of |
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
4e86fc3 to
bd308d7
Compare
| /// Per Arrow's spec, the per-row type tag is an `int8`. By default, tag `i` selects the child at | ||
| /// offset `i` (`type_ids = [0, 1, ..., N-1]`). Schemas may also use non-consecutive tags (e.g. | ||
| /// `[0, 5, 7]`), in which case the value of `type_ids[i]` is the tag used in the data to select the | ||
| /// child at offset `i`. Supporting non-consecutive tags from v1 lets the schema remove children | ||
| /// without renumbering the remaining tags. |
There was a problem hiding this comment.
why i8 not u8?
Its worth explaining why we have type_ids and non-consecutive tags too?
There was a problem hiding this comment.
Arrow uses i8 and the maximum number of variants is 128 so this is fine imo
| /// Variant names must be distinct. Unlike [`StructFields`](crate::dtype::StructFields), which | ||
| /// permits duplicate field names for Arrow/Parquet round-trip fidelity, duplicates have no | ||
| /// meaningful semantics in a sum type and are rejected at construction. | ||
| /// |
There was a problem hiding this comment.
so we cannot round trip arrow?
If you use index access not name then it does make sense to have duplicate fields. Fields are just metadata in arrow.
| /// // Accessing a field by name will yield the first | ||
| /// assert_eq!(fields.field("int_col").unwrap(), DType::Primitive(PType::I32, Nullability::Nullable)); | ||
| /// ``` | ||
| #[allow( |
There was a problem hiding this comment.
expect doesnt work here (for some reason, I have no idea)
|
@joseph-isaacs wants to split this out, so I will remove |
Summary
Tracking issue: #7882
Adds the scaffolding and boilerplate for the new logical
DType::Uniontype.This is the first PR of probably a bunch, so it simply adds the
DType::Unionvariant and theUnionVariantstype thatDType::Unionholds. Other than that, it is mostly just proto and flatbuffers code and other boilerplate that is easy to implement.A lot of the code is essentially duplicated from implementations of
Struct. It could have been nice if we could reuseStructFields, but we need to additionally store a type IDs field that maps the different variant types to tag indexes (i.e. 0, 1, 2, 5), so we might as well have a new type for this that is named more appropriately.Also note that the tags do not have to be consecutive (to support potential schema evolution, and also it does make things like casts easier).
I've also enforced that variant names are distinct. The Arrow spec does not make any guarantee about this, but I've chosen the sane path that DuckDB also chooses.
Finally, I reordered some of the dtype match statements so that struct and union are always next to each other after list and fsl. This did cause a bit of extra churn... but with so many types now I think it is important that similar things are next to each other.
Edit: There's 1000+ lines of flatbuffers diff because I think my local
flatcinstall is different from the one on CI?API Changes
Adds
Unionvariant toDTypeTesting
Basic testing (mostly for the serialization roundtrips).